Predicting the death by cancer rate for US counties

Simon Kennedy

Notebook 1 - Preprocessing

Problem definition

Predicting the death by cancer rate for US counties.

1. Loading the dataset

The original location of the file:

https://data.world/nrippner/ols-regression-challenge/file/cancer_reg.csv

Data Dictionary

TARGET_deathRate: Dependent variable. Mean per capita (100,000) cancer mortalities(a)

avgAnnCount: Mean number of reported cases of cancer diagnosed annually(a)

avgDeathsPerYear: Mean number of reported mortalities due to cancer(a)

incidenceRate: Mean per capita (100,000) cancer diagoses(a)

medianIncome: Median income per county (b)

popEst2015: Population of county (b)

povertyPercent: Percent of populace in poverty (b)

studyPerCap: Per capita number of cancer-related clinical trials per county (a)

binnedInc: Median income per capita binned by decile (b)

MedianAge: Median age of county residents (b)

MedianAgeMale: Median age of male county residents (b)

MedianAgeFemale: Median age of female county residents (b)

Geography: County name (b)

AvgHouseholdSize: Mean household size of county (b)

PercentMarried: Percent of county residents who are married (b)

PctNoHS18_24: Percent of county residents ages 18-24 highest education attained: less than high school (b)

PctHS18_24: Percent of county residents ages 18-24 highest education attained: high school diploma (b)

PctSomeCol18_24: Percent of county residents ages 18-24 highest education attained: some college (b)

PctBachDeg18_24: Percent of county residents ages 18-24 highest education attained: bachelor's degree (b)

PctHS25_Over: Percent of county residents ages 25 and over highest education attained: high school diploma (b)

PctBachDeg25_Over: Percent of county residents ages 25 and over highest education attained: bachelor's degree (b)

PctEmployed16_Over: Percent of county residents ages 16 and over employed (b)

PctUnemployed16_Over: Percent of county residents ages 16 and over unemployed (b)

PctPrivateCoverage: Percent of county residents with private health coverage (b)

PctPrivateCoverageAlone: Percent of county residents with private health coverage alone (no public assistance) (b)

PctEmpPrivCoverage: Percent of county residents with employee-provided private health coverage (b)

PctPublicCoverage: Percent of county residents with government-provided health coverage (b)

PctPubliceCoverageAlone: Percent of county residents with government-provided health coverage alone (b)

PctWhite: Percent of county residents who identify as White (b)

PctBlack: Percent of county residents who identify as Black (b)

PctAsian: Percent of county residents who identify as Asian (b)

PctOtherRace: Percent of county residents who identify in a category which is not White, Black, or Asian (b)

PctMarriedHouseholds: Percent of married households (b)

BirthRate: Number of live births relative to number of women in county (b)

(a): years 2010-2016

(b): 2013 Census Estimates

https://data.world/nrippner/ols-regression-challenge

Checking data types

2. Pre-processing the dataset

a. Cleaning

Creating a copy of cancer_df to make changes to

All variables are the correct type

If they were not then the astype() function could be used. E.g. if I preferred the 'avgAnnCount' to be an integer rather than a float, I would use the code below:

cancer_1.avgAnnCount = cancer_1.avgAnnCount.astype( 'int64' )

The categorical variables will be converted to numerical before the modelling process as further cleaning is necessary on these varibles before they are ready for use.

Exploring the dataset

Some variables have less than count = 3047, suggesting missing or na values.

There are some obvious potential outliers in variables:

Checking for missing values

PctSomeCol18_24 has 2285 na values. This is 75% of the data. It is best ot remove this variable.

PctEmployed16_Over has 152 na values, which is only 5% of the data. The na data will be replaced with the column mean.

PctPrivateCoverageAlone has 609 na values, which is 20% of the data. The na data will be replaced with the column mean.

Dropping the variable 'PctSomeCol18_24' and replacing the na values with the variable mean in 'PctEmployed16_Over' and 'PctPrivateCoverageAlone'

Comment:

Creating another copy of the dataframe

Histogram plots of variables with potential outliers

'Average annual count' - mean number of reported cases of cancer diagnosed annually

There are 2 obvious outliers at approx. 25000 and 38150. Let's remove these values.

Mean deaths per year

There are 2 obvious outliers at approx. 9200 and 14010. Let's remove anything above 6000.

Per capita number of cancer - related clinical trials per county

There are 3 obvious outliers at approx. 6800, 9500 and 9762. Let's remove any entries greater than 6000.

Median Age

Obvious outliers appear in this variable. Lets filter out values > 100

Birth rate - Number of live births relative to number of women in county

There is an obvious outlier at 21.36. Let's remove this.

creating another copy

b. Dealing with categorical variables

Split 'Geography' column into 'County' and 'State' by delimiter

Dropping 'Geography'column as it is not necessary. 'County' is simply an ID column so will be removed

Re-formatting the 'binnedInc' elements to Levels 1 - 10 to make it easier to interpret

Level 1 being the lowest income group and Level 10 the highest income group.

The element will be purely numerical, i.e. integers to enable use in the modeling process.

c. Correlations

pair plot

This is too crowded to be very useful. Feature names are unreadable.

Correlation Heatmap

There are no highly correlated features

Further investigation into feature importance using the Wrapper Method Using Backward Elimination

With only 1 variable have a p-value lower than 0.05 This does not inform the feature selection process any further.

Target correlations to inform which variables to drop due to collinearity.

The are several features that are highly correlated with each other (i.e. corr > 0.8). They are listed below with the correlation value with the TARGET_deathRate.

For each group, the feature with the highest correlation to TARGET_deathRate will be kept and the others dropped, apart from the MedianAge group. I will keep both MedianAgeMale and MedianAgeFemale and drop Median Age.This was done to maintain an even number of genders and reduce potential bias.

Feature selection

Storing the target and the features that will feed the model

making another copy first

Re naming the dataset

Saving the data with the to_csv function, for use in each of the different models

d. Normalisation

Normalisation will be done as part of the model pipeline in each notebook using MinMaxScaler( )

e. PCA

Data Segregation

It looks like approximately 9 components explain at least 90% of the data. Lets check this with .

Actually 11 components are required to explain at least 90% of the variance.

New dataframe of 11 PC's

0.007% lost due to PCA

Establishing a baseline (6.a.i.)

Dummy regressor baseline

Splitting the data first and using a seed for consistency throughout the project